DETERMINING AND USING ACOUSTIC CONFUSABILITY. ACOUSTTC 
PERPLEXITY AND SYNTHETIC ACOUSTIC WORD ERROR RATE 



Cross Reference to Related Applications 

This application claims the benefit of United States Provisional Application 
Number 60/199,062, filed April 20, 2000. 

Field of the Invention 

The present invention relates to speech recognition systems and, more 
particularly, to the definition and efficient computation of the numerical quantities called 
acoustic confusability, acoustic perplexity, and synthetic word error rate, used in the 
creation and operation of speech recognition systems. 

Background of the Invention 

Li the operation of a speech recognition system, some acoustic information 
IS acquired, and the system determines a word or word sequence that corresponds to the 
acoustic information. The acoustic information is generally some representation of a 
speech signal, such as the variations in vohage generated by a microphone. The output of 
the system is the best guess that the system has of the text corresponding to the given 
utterance, according to its principles of operation. 

The principles appUed to determine the best guess are those of probability 
theory. Specifically, the system produces as output the most likely word or word sequence 
corresponding to the given acoustic signal. Here, "most likely" is determined relative to 
two probability models embedded in the system: an acoustic model and a language 
model. Thus, if A represents the acoustic information acquired by the system, and W 
represents a guess at the word sequence corresponding to this acoustic information, then 
the system's best guess at tiie true word sequence is given by the solution of the 
following equation: 
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= argmaxw P(A \ W) P(W). 

Here P(A \W)is2L number determined by the acoustic model for the system, and P(W) is a 
number determmed by the language model for the system. A general discussion of the 
nature of acoustic models and language models can be found in "Statistical Methods for 
Speech Recognition," Jehnek, The MIT Press, Cambridge, MA, 1999, the disclosure of 
which is incorporated herein by reference. This general approach to speech recognition is 
discussed in the paper by Bahl et al, "A Maximum Likelihood Approach to Continuous 
Speech Recognition," IEEE Transactions on Pattern Analysis and Machine IntelHgence, 
Volume PAMI-5, pp. 179-190, March 1983, the disclosure of which is incorporated 
herein by reference. 

The acoustic and language models play a central role in the operation of a 
speech recognition system: the higher the quality of each model, the more accurate the 
recognition system. A frequently-used measure of quality of a language model is a 
statistic known as the perplexity, as discussed in section 8.3 of Jelinek. For clarity, this 
statistic will hereafter be referred to as "lexical perplexity." It is a general operating 
assumption within the field that the lower the value of the lexical perplexity, on a given 
fixed test corpus of words, the better the quality of the language model. 

However, experience shows that lexical perplexity can decrease while 
errors in decoding words increase. For instance, see Clarkson et al., "The Applicability of 
Adaptive Language Modeling for the Broadcast News Task," Proceedings of the Fifth 
hitemational Conference on Spoken Language Processing, Sydney, Australia, November 
1998, the disclosure of which is incorporated herein by reference. Thus, lexical perplexity 
is actually a poor indicator of language model effectiveness. 

Nevertheless, lexical perplexity continues to be used as the objective 
fimction for the training of language models, when such models are determined by 
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varying the values of sets of adjustable parameters. What is needed is a better statistic for 
measuring the quality of language models, and hence for use as the objective function 
during training. 



5 Summary of the Invention 

The present invention solves these problems by providing two statistics 
that are better than lexical perplexity for determining the quality of language models. 
These statistics^ called acoustic perplexity and the synthetic acoustic word error rate 
(SAWER), in turn depend upon methods for computing the acoustic confusability of 
10 words. Some methods and apparatuses disclosed herein substitute models of acoustic data 
in place of real acoustic data in order to determine confusabiUty. 

Li a first aspect of the invention, two word pronunciations l(w) and /(x) are 
chosen from all pronunciations of all words in fixed vocabulary V of the speech 
recognition system. It is the confusability of these pronunciations that is desired. To do 
15 so, an evaluation model (also called valuation model) of l(x) is created, a synthesizer 
model of l(x) is created, and a matrix is determined firom the evaluation and synthesizer 
models. Each of the evaluation and synthesizer models is preferably a hidden Markov 
model. The synthesizer model preferably replaces real acoustic data. Once the matrix is 
determined, a confusability calculation may be performed. This confusability calculation 
20 is preferably performed by reducing an infinite series of multiplications and additions to a 
finite matrix inversion calculation. Li this maimer, an exact confusability calculation may 
be determined for the evaluation and synthesizer models. 

hi additional aspects of the invention, different methods are used to 
determine certain numerical quantities, defined below, called synthetic likelihoods, hi 
25 other aspects of the invention, (i) the confiisabihty may be normalized and smoothed to 
better deal with very small probabiUties and the sharpness of the distribution, and (ii) 
methods are disclosed that increase the speed of performing the matrix inversion and the 
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confusability calculation. Moreover, a method for caching and reusing computations for 

similar words is disclosed. 

Still more aspects of the invention determine and apply acoustic perplexity 
and the synthetic acoustic word error rate (SAWER). Acoustic perplexity and SAWER 
5 are better determinants of the effectiveness of language models than is lexical perplexity. 

A more complete understanding of the present invention, as well as further 
features and advantages of the present invention, will be obtained by reference to the 
following detailed description and drawings. 

10 Brief Description of the Drawings 

FIG. 1 is a flow chart describing an exemplary process that determines and 
uses confusabihty, in accordance with one embodiment of the present invention; 

FIG. 2 is a diagram of an exemplary speech portion; 

FIG. 3 is a diagram of evaluation and synthesizer models, in accordance 
1 5 with one embodiment of the present invention; 

FIG. 4 shows an evaluation model, a synthesizer model, and a product 
machine in accordance with one embodiment of the present invention; 

FIG. 5 shows the product machine of FIG. 4; 

FIG. 6 shows a probability flow matrix determined from the product 
20 machines of FIGS. 4 and 5, in accordance with one embodiment of the present invention; 

FIG. 7 is a diagram of a trellis of Hwix, in accordance with one embodiment 
of the present invention; 

FIG. 8 shows an evaluation model, a synthesizer model, and a product 
machine in accordance with one embodiment of the present invention; 
25 FIG. 9 shows a matrix determined from the product machine of FIG. 8 in 

accordance with one embodiment of the present invention; and 

FIG. 10 is a block diagram of a system suitable to carry out embodiments 
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of the present invention. 

Detailed Description of Preferred Embodiments 

The present invention provides two statistics that are better than lexical 
5 perplexity for determining the quality of language models. These statistics, called 
acoustic perplexity and the synthetic acoustic word error rate (SAWER), in turn depend 
upon methods for computing the acoustic confusability of words. There are multiple 
methods for computing acoustic confusability, but two particular methods will be 
discussed herein, hi one method of determining confusabihty, a model of acoustic data is 
10 substituted in place of real acoustic data in order to determine confusability. hi another 
method, the edit distance between phoneme sequences is used to determine confiisabihty. 
Confusability as determined by the first method depends on terms entitled "synthetic 
likelihoods " and several different techniques are disclosed herein that determine these 
likelihoods. 

15 Acoustic perplexity and SAWER have multiple applications, hnportantly, 

these may be appUed to determine the quahty of language models and during language 
model generation to improve language models. 

m ORGANIZATION 

20 This disclosure is arranged as follows. In Section (2), a short introduction 

to a method for determining and using confusabihty is given. In Section (3), acoustic 
perplexity is introduced. This is related and compared to the concept of lexical perplexity, 
which statistic the present invention supersedes. In section (3)(a), it is shown how 
acoustic perplexity may be determined, in a way analogous to lexical perplexity. How the 

25 language model probabilities enter into the definition of acoustic perplexity is also 
shown. A quantity known as the acoustic encoding probability is defined, which is used 
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in determining acoustic perplexity. 

In section (3)(b), the computation of acoustic encoding probability is 
developed. This development proceeds through sections (3)(c) through (3)(f), which 
sections also relate the acoustic encoding probability to acoustic confusabiUty. In section 
5 (3)(c), the concept of multiple lexemes or pronunciations is discussed. Next, also in 
section (3)(c), the modeUng of sequences of real acoustic observations is discussed, 
followed by a discussion of replacement of real observations with synthetic observations. 

In section (3)(d), the development of a method for computing acoustic 
confusabihty between models is begun; in this section, confusability of densities is also 

10 discussed.. This development proceeds by the following steps. In section (3)(e), the 
interaction between paths and densities is shown and discussed. In section (3)(f), an 
algorithm for computing confusability of hidden Markov models is disclosed. This 
method, as shown in section (3)(f), comprises the following: (i) constructing a product 
machine; (ii) defining the transition probabihties and synthetic likelihoods of the arcs and 

15 states of this product machine; and (iii) determining a probability flow matrix and relating 
the result of a certain matrix computation to acoustic confusabiUty. 

In section (4), the disclosure returns to the definition of acoustic 
perplexity. Next, in section (5), SAWER is disclosed. The methods of section (3) are 
expanded to multiple lexemes (section (6)) and continuous speech (section (7)). Section 

20 (8) discusses using acoustic perplexity and SAWER during training of language models. 
This section also discusses minimization of acoustic perplexity and SAWER, and 
discloses additional computational refinements. 

Section (9) discloses general computational enhancements when 
determining acoustic confusability. In section (9)(a), it is shown how the matrix 

25 computation, performed when determining confusability, may be performed efficiently. 
Section (9)(b) discusses additional efficiencies by caching, while section (9)(c) discusses 
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thresholding. 

Section (10) discloses additional applications of acoustic perplexity and 
SAWER. In section (10)(a), a method of vocabulary selection is disclosed. Section 
(10)(b) explains selection of trigrams and maxent model features. Section (11) explains 
5 alternative, approximate techniques to compute confusability, which are based on edit 
distances between phoneme sequences. Finally, section (12) discusses an exemplary 
system for carrying out embodiments of the present invention. 

(2) EXEMPLARY METHOD FOR DETERMINING CONFUSABILITY. 
10 ACOUSTIC PERPLEXITY AND SAWER 

Now that the organization of this disclosure has been discussed, it is 
beneficial to present a simple method for determining and using acoustic confusability 
(or, more simply, "confusability"'). FIG. 1 shows a flow chart of an exemplary method 

15 100 for determining confusabihty, acoustic perplexity, and SAWER. Method 100 
illustrates one particular technique used to determine confusabihty. Because the 
determination of confusability, acoustic perplexity and SAWER is mathematically 
complex for those not skilled in the art, FIG. 1 is introduced at this point to give an 
overview of method 100. Each step in method 100 is described below in more detail. 

20 Method 100 begins when evaluation and synthesizer models are created in 

step 110. As is known in the art and discussed in greater detail below, hidden Markov 
models are used to statistically model acoustic pronunciations of words. The exemplary 
evaluation and synthesizer models are therefore hidden Markov models. 

In step 120, a product machine is created from the evaluation and 

25 synthesizer models. This is explained in greater detail below. The product machine is 
used to create a probability flow matrix (step 130). Entries of the probability flow matrix 
depend on synthetic likelihoods, which are determined in step 140. After entries of the 
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probability flow matrix have been determined, confusability is determined (step 150). 
Actual confusability values can be very small values. Consequently, the confusability is 
normalized. Additionally, the confusability is smoothed to correct for resultant 
probabilities that are too sharp. Normahzation and smoothing occur in step 160. 
5 Once the confusability has been determined, smoothed, and normahzed, 

the acoustic perplexity and/or SAWER may be determined in steps 170 and 180, 
respectively. Both acoustic perplexity and SAWER are superior to lexical perplexity as 
predictors of language model performance. Thus determination of these quantities, and 
using them as objective functions when adjusting language model parameters, improves 

1 0 language model generation. 

It is beneficial at this juncture to start with a description of some basic 
equations that concem speech recognition. Moreover, it is also beneficial to relate 
confusability with acoustic perplexity, as one apphcation of confusability is that it is used 
to determine acoustic perplexity. The following technique apphes to full words, and 

15 context may also be taken into account. Here "context" means that the details of 
pronunciation of each of the words or phrases whose confusability is being computed 
depends in some way on one or more adjacent words. However, some simplifications 
help make the following method more understandable. These simphfications assume that 
each of the evaluation and synthesizer models comprises one word that is a single 

20 phoneme long, and context is ignored. These simplifications are used to make the 
drawings easier to understand, but the method itself is appUcable to large words and/or 
phrases, and to cases where context is taken into account, in the sense given above. 

(3) CONFUSABILITY 

25 Let Pe be a language model, where P^(wo ... w J is the probability that 

the model assigns to word sequence wo • . • and 0 is a (typically very large) set of 
parameters, which determines the numerical value of this probability. One widely-used 
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method of assigning values to the elements of 9 is to obtain a corpus C equal to wo . . .wat 
of naturally generated text, also very large, and to adjust 6 to maximize Pe{C), the 
modeled probability of the corpus. This is an instance of the well-established principle of 
maximum-likelihood estimation. The model is made to accord as closely as possible with 
5 a collection of examples, in hopes that it will function well as a predictor in the future. 
More formally, the adjustable parameters of the model are set to values that maximize its 
assignment of probabihty to instances of the phenomenon being modeled. Within the 
speech recognition field, this adjusting process is commonly called "training." 



10 known to occur, while assigning little or no probabihty to those that do not, this approach 
is reasonable. But this approach leads to the assumption that the higher the value PeiC) 
attains, or possibly the same statistic PeiT), computed on an independent test corpus 7, 
the better the model must be. Because the quantity 



called the perplexity (and herein called the lexical perplexity), is inversely related to the 
raw hkelihood Pe{C) , the assumption has arisen that the lower the perplexity, the better. 
Perplexity is commonly reported in papers on language modeling. However, it is known 
20 from experience that lexical perplexity is a poor indicator of the performance of a 
language model. For instance, it can be shown that, as lexical perplexity decreases, the 
error rate of decoding can actually increase. 



Since the aim of modeling is to assign high probabihty to events that are 



15 



YL{PeX) = Pe{C) 



-i/lci 



(1) 



25 



(3)(a) Acoustic Perplexity 

As an analog of lexical perplexity, the quantity 



YAiP6X,A)^Pe{C I A) 



-i/lcl 



(2) 
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is defined as the acoustic perplexity of the model Pq, evaluated on lexical corpus C and 
its acoustic reahzation ^. In the same way as Pe is decomposed into a product of 
individual-word probabilities, for use in computing Yl, so too may Pe{C j ^) be 
5 decomposed. 

To express this decomposition, the following notational conventions are 
adopted. The word that is being decoded, at the current position i of a textual corpus, is 
Wi. Its acoustic reahzation, which is the sound of someone speaking this word, is written 
a{w^). The sequence of all words preceding which is Wo, Wi, . . . , W/.i, is denoted hf, and 

10 its acoustic realization is a(A/). Likewise, the sequence of all words following Wi is written 
r„ with acoustics denoted a(r;). Here, the letter r is used to suggest "right context," 
meaning the words occurring in the fiiture. hi general, "hr is a series of words that occurs 
in the past, 'V/' is a word that occurs in the present (at time i), and "r/' is a series of 
words that occurs in the future. 

15 Referring to FIG. 2, a representation 200 of an acoustic event is shown. 

Representation 200 illustrates the notational conventions discussed in the preceding 
paragraph. The acoustic event is someone saying the phrase "The cat jumped onto the 
wall." Representation 200 is an exemplary analog signal of the acoustic event that is 
obtained, for instance, with a microphone. In the sentence, "The cat jumped onto the 

20 wall," the h represents "The cat jumped onto" and any other preceding words, Wi 
represents "the," and n represents "wall" and ajny other subsequent words. In this context, 
hi, Wi, and rt represent the words being spoken as abstract linguistic tokens. The variable 
a(hi Wi Vi) represents the acoustics of those words. In other words, a(hi Wi n) represents 
how a person actually says the sentence, "The cat jumped onto the wall." Similarly, 

25 acoustics a(hi) represents a person saying "The cat jumped onto;" acoustics a(wi)- 
represents a person saying "the;" and acoustics a(ri) represents a person saying "wall." 
Note that hi is the word history at position i, and comprises all words up to but not 
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including the current word wu 

Generally, the acoustic event a{wi) is a radially expanding pressure wave, 
emanating from the mouth of a speaker when the word w/ is pronoimced. The pressure 
wave is transduced to an analog electrical signal (shown as representation 200), which is 
5 then quantized in time and amplitude. The electrical signal is ultimately processed into a 
finite sequence {d^o ... a^r-i ), also written (a^i), where / is understood to run from z = 0 to 
some limit T - 1, of (i-dimensional feature vectors. The list {dy^i) constitutes the 
observation sequence. More specifically, most speech recognition systems operate as 
follows. The sequence of time- and amplitude-quantized samples is separated into a series 

10 of overlapping blocks of samples. Each block, which is also called a frame, is then 
processed by well-known techniques to yield a cepstral vector. Typically a new cepstral 
vector is generated every 10 milhseconds (ms); many systems use a cepstral vector of 13 
dimensions. The first and second differences of these cepstral vectors are then typically 
concatenated to each original 13 dimensional vector, yielding a series of 39 dimensional 

15 vectors. For such a system, the variable d mentioned above attains the value 39. 
Therefore, each 10 ms, another feature vector having 39 dimensions will be created. The 
acoustic event a(w) constitutes a contiguous sequence of these J-dimensional feature 
vectors. A speech recognition system attempts to analyze these feature vectors and, from 
them, determine the word w. Li the notation just given, (d^i > is precisely this sequence of 

20 contiguous feature vectors. 

By elementary probabihty theory, the decomposition of PeiQ may be 
expressed as follows: 



Po(C)=Pe(wQ wi . . . w\c\-i)= Ylpoi-^i I hi) . (3) 

ieC 

25 

For this reason, Pe is identified with the family of conditional distributions 
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{po(w \ h)] that underlies it. The family {pe{w \ h)}, and the general model Pe are 
spoken of interchangeably in the singular as "a language model." 

Applying this to the decomposition of Po(C \ A), one may write 
C = Wo wi . . . W|c|-i and A = a{wo wi . . . w\c\-\ ) . By the rules of conditional 
5 probability, it follows that: 



PoiC I A) = TLpe (wi I hi a{wo wi . , . w\c\-i)) (4) 
ieC 



10 



= Ylpdi^i I hiaihiWiTi)) , (5) 
ieC 



where the second line is a purely notational variant of the first. (This is so because even 
as i varies, a{hi Wi ri) continues to denote the entire acoustic signal) This expression is 
appropriate for recognition of continuous speech. For recognition of discrete speech, 
where each word is unmistakably surrounded by silence, the simpler form shown below 
15 occurs: 



PeiC I A) = Ilpeiwi \ h aiwd) . (6) 
ieC 



Next, how the language model probability enters explicitly into this 
20 expression is shown. Consider any one factor in equation (5). Suppressing the / subscript 
for readability, by Bayes theorem it may be written: 



/ 1 r /7 pjajhwr) \wh)peiw\ h) 

peKw ha{hw r)) = ^ r— r- — "t TTT\' 

LxerM^v/x >v^) I xh)pe\x \ h) 
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where Fis the vocabulary of all words recognized by the speech recognition system. Here 
Pd{v^ I h) and peix \ h) are regular language model probabilities. For the case of discrete 
speech, this expression takes the form: 



( \ 1 ( w \^h)pQ{w\ h) 

pe{w h a{w)) = ^ 7TT'\ — lA — TTIa' 



This disclosure will refer to the family of conditional distributions {p{a(hwr)\xh)], 
which arises in the continuous case, or the family {p{a{w)\xh)}, which arises in the 
discrete case, as an acoustic encoding model. An individual value p{a{hwr)\xh) in the 

10 continuous case, or p{a{w)\xh) in the discrete case, is called an acoustic encoding 
probability (or, more precisely, the acoustic encoding probability of x as aQiwr) or a{w) 
in context h). The ordering of conditioning information as or Ax is immaterial here, 
and we will interchange them freely. 

Thus, the acoustic encoding probabihty should be computed to determine 

15 acoustic perplexity. Under appropriate assumptions, the continuous-case acoustic 
encoding probability may be approximated as: 

p{a{h wr)\hx)^ p(^(h) | h) p(a(w) \ h x) p(a(r) \ h). (9) 

20 Several assumptions were made to simplify the acoustic encoding 

probabihty as shown in equation (9). First, it was assumed that a(h w r) = a(h)a(w)a(r). 
This assumes independence of these acoustic events. In reality, these will generally not be 
independent, as the tongue and other vocal equipment do not change instantaneously to 
be able to transition from one word to the next. Second, this also assumes that a(h) and 

25 a(r) do not depend on the word x. 

Thus, xmder these assumptions, the value p(a(w) | A x^ is now the quantity 
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that needs to be determined, in either the continuous or discrete cases. This disclosuer 
will treat the continuous case in more detail later. 

(3)(h) Computing Acoustic Encoding Probability 

5 Thus, the aim is to provide a working definition of an acoustic encoding 

probability, in the discrete case written piaiw) \ h x). The term p(a(w) \ hx) is defined 
herein as the acoustic conftisability of w with respect to x. For simphcity, the context h 
may be ignored for the moment and just p(a(w) \ x) considered. Here a{w) is an acoustic 
event, which is to say a signal that exists in the physical world. By comparison x is an 
10 element of an abstract space of words, drawn fi*om a finite set V, the vocabulary. This x is 
just a placeholder, to determine which model to use when computing p(a(w) | x). Note 
that p(a(w)\ x) is nominally the probabihty of observing acoustic signal a(w), given that x 
was the word actually being pronounced. 

15 (3)(c) Lexemes and Replacement of Real Acoustic Observations with 

Synthetic Acoustic Observations 

If there were only one single model p(*\x) for word x (that is, one single 
model for evaluating acoustic event probabihties, on the assumption that xwas the word 
being pronounced), then piaiw) \ x) would be the probability that this model assigns to 

20 the observation a(w). But, in general, a given word x has many pronunciations. For 
instance, the word "the" may be pronounced "TH UH" or "TH lY," wherein the "TH," 
'TJH," "TH," and 'W are phonemes (also called "phones"). Phonemes are small, 
indivisible acoustic elements of speech; here UH represents a schwa sound, whereas lY 
represents a long e vowel. There are about 51 phonemes in Enghsh, although some 

25 speech recognition systems might have more or less phonemes. These pronunciations are 
referred to as lexemes or baseforms, and the following may be written: 
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to indicate that a word xadmits of multiple pronunciations l\x),P(x) and so on. Here 
is the number of distinct pronunciations recognized for x and each P(x) is one lexeme. 
5 Carrying this notation a httle further, it is possible to write /(x)e x for an arbitrary lexeme 
l(x) associated with the word x, and S^^^^^for a sum in which l(x) varies over the lexeme 
set for X. 

Using these known pronunciations, Equation 9 becomes: 
10 p(a(w)\hx)^ S p(a(w)\hxl{x))p(l(x)\hx). (11) 

hi this equation, each pronunciation l(x) is a sequence of phones. The 
qmntity p(a(w) \ h x l(x)) is the probability of observing the acoustic signal a(w), given 
the history/context h the word x, and the pronunciation of the word x as l(x). Since the 

15 pronunciation /fx) is uniquely associated with word x, listing x as part of the conditioning 
information as well is unnecessary. Thus, the quantity p(a(w) \ h x l(x)) may be written 
p(a(w) I h l(x)) with no loss of generality. 

It is assumed that the model pi- \ l(x) h) is a continuous-density hidden 
Markov model. Hidden Markov models are described in Jelinek, "Statistical Methods for 

20 Speech Recognition," The MIT Press, Cambridge, MA, 1997, the disclosure of which is 
incorporated herein by reference. Such a hidden Markov model consists of a set of states 
Q with identified initial and final states, a probabihty density function 3qq^ for each 
allowed state-to-state transition, and a matrix r of transition possibilities. The likelihood 
of a sequence of observations p{{dy,i) \ l{x) h) is then taken as the sum over all paths 

25 from the initial to final state of the joint probability of a given path and the associated 
sequence of observation probabilities. Since such a model serves to evaluate the 
likelihood of an observation sequence, it is referred to as an evaluation model. 
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This likelihood would be the number sought if a joint corpus {C,A) (the 
corpus of words that have equivalent acoustic events associated with them) was 
sufficiently large. With a large enough joint corpus, one could simply determine from the 
joint corpus the likelihood of any observation sequence, on the assumption that each word 
5 of the vocabulary was pronounced a very large number of times. However, the typical 
joint corpus<C,^) is very small as compared to a regular language model corpus. For 
instance, a large joint corpus is generally 1000 times smaller than a typical language 
model corpus. Moreover, the sequence {d^i ) constitutes one single pronunciation of the 
word w. This could be the sole instance, or one of the few instances, of the word in the 

10 entire corpus. It would be misleading, when estimating word confusabihties, to rely upon 
so little data. What would be ideal is a large number of such instances of pronunciations, 
all pronounced in the same context h, but also comprising a large number of distinct 
contexts. Unfortunately, no such acoustic corpus exists. 

For this reason, the strategy of synthesizing observation sequences 

15 corresponding to a{w) is adopted. To synthesize observation sequences, the same type of 
model is employed to evaluate the likelihood of an observation sequence is used as a 
generative model for a(w). However, this model operates with its own densities and 
transition probabihties to generate data points. Though it preferably has exactly the same 
form as an evaluation model, this model will be referred to as a synthesizer model. The 

20 synthesizer and evaluation models are described in more detail in reference to FIG. 6. 

Thus, a true acoustic realization a(w) is replaced by a model d(w). 
Therefore, ;?fhfWj | h l(x)) becomes 

p{a{w) I h /(x)) ^ p(d(w) I h /(x)). (12) 

25 

There are some additional assumptions that may be made to simplify 
Equation 12. While these assumptions are not necessary, they do act as an aid to 
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understanding the invention. Some exemplary assumptions are as follows. If the word w 
is assumed to have only one pronunciation and both the word w and the word x are 
considered to consist of only one phone, simple models will result for each of the 
synthesizer and evaluation models. (Arbitrarily long words or phrases may be treated by 
5 this method, simply by stringing together individual phone models, one after the other.) 
These simple models are shown in more detail in FIG. 3. Mathematically, for the phone 
W, which is a representation of d(w) and the phone X, which is a representation of l(x), 
and if context h is ignored, Equation 12 becomes: 

10 p{d{w)\hl{x)) ^piW\X). (13) 

Methods for estimating the right hand side of equation 13 are discussed 
below. One method involves determining a probabihty flow matrix and using the 
probability flow matrix to determine acoustic confiisability. Prior to determining the 
15 elements of the probability flow matrix, synthetic likelihoods are determined. Synthetic 
hkelihoods are measures of how confiisable two given densities are. More specifically, 
synthetic likelihoods are measures of the similarity or confusability of two specific output 
or observation densities associated with two hidden Markov models. There are various 
methods for determining synthetic likelihoods, which are described below. 

20 

f3)(d) Confusability of Densities 

Synthetic likelihoods arise fi-om the problem of synthesizing data 
according to one probability density and evaluating it according to another. The following 
four techniques to determine synthetic likelihoods are discussed herein: the cross-entropy 
25 measure; the dominance measure; the decoder measure; and the empirical measure. Each 
of these may be compressed through the techniques of compression by normalization or 
compression by ranks. The compressed synthetic hkelihoods are the k values that appear 
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throughout the drawings and that will most likely be used in the definitions of acoustic 
confusabiUty, acoustic perplexity, and synthetic acoustic word error rate. 

mfd)(l^ The Cross-EntroDV Measure 

5 First, consider a radically simphfied version of this computation. Suppose 

that for every acoustic event a(w), the associated sequence (a^.> has length 1, and that the 
dimension d of this single vector is also L In other words, a(w) is identified with a single 
real number a^. Likewise, suppose that the valuation model pi* \ l(x) A) has a single 
transition, with associated density Skj,)^, hereafter abbreviated S^c^ Hence, if 
10 = {a,,i , , , u^n} is a corpus of one-dimensional, length-1 observation vectors 
corresponding to N true independent pronounced instances of word w, then the likelihood 
of these observations according to the valuation model is 



LiAy, I Sx) = dxia^i ) ' • • Sxiay^N). (14) 

15 

The following discussion relates how to replace true observations with 
synthetic ones. Assume for a moment that word w has a single pronunciation l(w), and 
consider a synthesized observation corpus Aw = {d^u.-dy^N}, where the elements are 
independent and identically distributed (iid) random variables, distributed according to 
20 density (5/(^);i(0, hereafter abbreviated dw^ Fix some finite interval [-r, r], and assume that 
it is divided into subintervals Ji = [vi.Vi + Av], where = -r + z A v and Av = 2r/N, 
where i runs from 0 to N - 1. The expected number of elements of falling into Ji is 
therefore dwivi) -Av-K The synthetic likelihood of this sequence is defined as 



25 



LrN(Aw I Sx)^YlSx(Viy 

i-0 



\(5w(v,)-Av-A^ 



(15) 
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Hence, the per-event synthetic log likelihood is 



15 



20 



^rAr(iw \^x) = jj l0gLriv(iw \ Sx) = 1, 5w(v/) • log 3x(Vi) AV, (16) 

i=0 



5 This is a Riemann-Stieltjes sum. At this point, assume that Sw and Sx are 

both mixtures of Gaussians. (Any other form of density may also be used, providing the 
integrals (17) and (18) below exist. The assumption that the densities are mixtures of 
Gaussians is made only to ensure this.) Then the integrand S^iv) • log Sxiv) is continuous 
on the compact set [-r, r], since dx(v) is bounded uniformly away from 0 on this interval, 
10 and so the limit 



SrN(A^ I Sx)= limSrN(A^ I --\^dy,{v)\ogSx{v)dv (17) 
exists (as both a Riemann and Lebesgue integral). Taking the limit as r ^ qo , define 

p{d^ \6x)^\d^\ogdx. (18) 



as the synthetic log likelihood of (5w given 5^. Exponentiating this quantity, it is possible 
to define 

k\8^ I ^J=exp/?(^w I 8 A (19) 



as the synthetic likelihood of Sy, given 8x^ This quantity will be treated as if it were a true 
likelihood. This substitution of synthetic for true likehhoods essentially uses synthesized 
25 data to compensate for a corpus that contains inadequate examples of actual 
pronunciations. 
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(3)fd)(l)(A) Compression by Normalization 

The synthetic hkehhoods have a very large dynamic range. The synthetic 
Ukehhoods can vary anywhere between about 10"^^^ to 10"^ Because of this, it is 
beneficial to compress the range. 

One technique for compressing the synthetic hkelihoods is to normahze 
them. Defining A as the set {Si-)} that appear in a speech recognition system, then 



K(d^\S.) = ^K'(S^\5A (20) 



10 where 



Z((5.) = I k'(3^^,). (21) 



Thus, Equations 20 and 21 compress the synthetic likelihoods through 

1 5 normalization, 

(3)(d)(l)(B) Compression by Ranking 

Another technique to compress the synthetic likelihoods is through 
ranking. In compression by ranking, each x'idw^x) in the synthetic likelihood set 

20 {k'(Swi ^%Sw2 \Sx), . * . . ^'(pwiM \^x) ] Is ranked by value firom largest to smallest. The 
resultant set is {k'(Ssi\5x),kXSs2\^x), • • . , ^'(psi^il^x^}, where s stands for "swapped," 
and K'(ds^\dx) is the largest K^dw^x) in the synthetic likelihood set, while k^(Ss^^^\5x^ is 
the smallest K\dw\5x) in the synthetic likelihood set. The resultant set, 
{kXSsi\Sx),k'{Ss2\Sx), . . ■ , ^^(ps\JSx)]y is paired with a known predetermined set of 

25 probabilities, {pupi, - - • , P\a\}^ To determine the final synthetic likelihood, k, one 
replaces the value K'(ds_^\Sx^ with the jth element of the predetermined set of probabilities. 
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The value so obtained is used as k for the particular K\5w\dx) being examined. 
f3)fdV2) Dominance Measure 

Another technique for determining synthetic likelihoods is the dominance 
measure. This technique uses 



where the integral in Equation 22 is typically evaluated by Monte Carlo methods, as 
described in Numerical Recipes in C, The Art of Scientific Computing, by Press, 
Teukolsky, Vetterling and Flannery, Cambridge University Press, Second Edition, 
Chapter 7, which is hereby incorporated by reference. It should be noted that all integrals 
discussed herein may be calculated through Monte Carlo methods. Additionally, 
compression by normahzation and compression by ranking may both be used with this 
measure to compress the synthetic hkelihoods. 

G^fdV3^ Decoder Measure 

Another method for determining synthetic likelihood is the decoder 
measure, hi this technique, 




(22) 



where 




(23) 




(24) 
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where 



1 ifSxio)=msixSw{v) 
0 otherwise 



(25) 



Synthetic hkeUhoods determined by the decoder measure may be compressed through 
5 normalization or ranking. 

f3Xd^(4) Empirical Measure 

This method uses training data, which is a series of cepstral vectors 
{vi,V2j...5VAr}. This training set of vectors is determined from speech taken of a 

10 person speaking a known text. There is a corresponding true set of probability density 
functions, {Sx^,Sx2, '--.^x^}. which are known because the underlying text and acoustic 
realization of that text are known. There is also a decoded set of probabiUty density 
f[mctions,{3w^,dy^^,...,Swj^}, which are determined by performing speech recognition 
on the cepstral vectors. Each 3 will be referred to as a "leaf herein. Thus, there is a true 

15 set of leaves and a decoded set of leaves, and these are paired one M^ith the other, on a 
vector-by- vector basis. 

The synthetic likelihood is then determined as 



kXSM-^^. (26) 

20 

where C(x) is the number of times 3x occurred among true leaves and C(w,x) is the 
number of times the true leaf xwas paired with the decoder's guess leaf w. Additionally, 
the synthetic Ukelihood of Equation 26 may be determined when the decoder does a fiill 
decoding at the word level or when the decoder does a full decoding at the leaf level 
25 Because Equation 26 guarantees normahzation of the synthetic 
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likelihoods, compression through normaUzation is not necessary. However, compression 
by ranking may be performed. 



(3)(e) Paths and Densities 

5 In this section, it is shown how paths, densities and synthetic likelihoods 

are related. In general, the valuation (synthesis) of a sequence depends upon a particular 
path in the evaluation (synthesis) model. This dependency expresses itself in two ways: 
through the probability of the path, and through the sequence of densities associated with 
the path. 

10 To begin, hidden Markov model formalism and notation are reviewed; this 

notation is established as follows: 

Q = {q^ } a set of states 

qi.qF respectively, initial and final states; drawn firom Q 
15 T = Umn } the probability of transition qm ^ qn 

S = {Smn } a collection of densities, where Smn is the density associated 
with transition q^n ^ qn 



The collection {Q,qi,qF,z,S) will be referred to as a hidden Markov 
20 model H, To distinguish between different models, corresponding to, for instance, 
lexemes l(x) and /(w), a subscript will be attached to the hidden Markov model. Thus, a 
hidden Markov model corresponding to the lexeme l(x) will be referred to as Hx, its state 
set as Q^, a transition probability as Xxmn, and so on. 

For a given length-T observation sequence (d^i ), the likelihood of this 
25 sequence according to the model H is 



L({d^: ) I /f) - S p((a^^ ) 1 7i)pi7c) (27) 
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Here, n is sequence of states (tt^^tt^ . . tt^), also called a ;7(3^A; /?(7z;) is the probability of 
5 the path n according to t; and p{{d^i) \ n) is the likehhood of the observation sequence 
according to the densities associated with Since an observation is associated with a 
transition and not a state, the hkelihood of a sequence of T observations is evaluated with 
respect to a path comprising T transitions and, therefore, T + 1 states. A path is 
considered vahd if %^ = qj and = qf. The sums in Equations 27 and 28 run over all 
1 0 vahd paths in Q^-^^ the (T + l)-fold Cartesian product of Q with itself 

Consider a restricted model Hx in which every transition is forced. Let 
= {qj,o^ qxL ^ ^ - , qxT = qxF} . with T^\Qx\-l,^rid suppose the only transitions with 
non-zero probability are qxm -^qxm+i^ Then there is only one valid path through this 
model, which is 
15 Tix^qxO • ^ • qxT 

and its probability is unity. Thus, the sum in Equation 28 collapses, and the following 
results 



i Hj=SMa^')Sxn(a^ ') - - S^r-iria^^-') (29) 

20 

for the length- r observation sequence {a^ 0- 

Suppose now that an observation sequence {uw ') is being synthesized 
according to an identically restricted model //w, so that there is only one valid path 
Tiy, = * • ^qwT through this model. An event synthesized by this model is a sequence of 
25 synthetic observations of length T, with sample ^ ^ Swoi, sample dw ^ - S^u, and so on. 
Conversely, evaluation model Hx concentrates all its probability mass on observation 
sequences of length 7, evaluating the likelihood of the first observation according to Sxou 
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and so on. Thus, it is written 

LiH^ I HxTt^Tlx) =KiSw 01 i §xOl)KiSwl2 \ Sx u) " ' K(dw T-IT \ SxT-It) (30) 

5 for the synthetic hkeUhood ofH^ according to Hx. 

The paths jtx and determine which k values appear in the product. A 
given Kidwmn I Sxrs) IS prescnt precisely when the corresponding transitions qwm^qwn 
and qxr-^ qxs traversed in the same discrete time interval in i/^ and Hx, respectively. 
The path probabilities themselves do not appear explicitly above because they are both 
10 unity, due to the restrictive transition structure of each model Li the sections that follow, 
this disclosure shows how this restrictive assumption may be removed. 

(3)(f> Confusabilitv of Hidden Markov Models 

This section details how confusability is determined from evaluation and 
15 synthesizer models that are hidden Markov models. This section proceeds as follows. 
First, this section presents a visual argument of how a product machine is designed from 
two hidden Markov models and how an exemplary probabihty flow matrix, M, is 
determined from the product machine. Next, this section describes a more mathematically 
formal argument for determining the product machine, the probabihty flow matrix that is 
20 determined from the product machine, and acoustic confusability. 

(3)(f)(l) Product Machine Determination: a Visual Example 

Jn what follows, Hx is the valuation model, and is the synthesizer 
model. For simphcity, it is assumed that both Hx and are three-state models of single 
25 phones, with the topology, densities and transition probabilities as depicted in FIG. 3. 
However the construction is entirely general, and there is no need to restrict the size or 
topology of either model. FIG. 3 shows synthesizer model 310, H^, and evaluation model 
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320, H^. Here has states Q^c ^ {x\, xi, x^}, transition probabilities xi, xi, Xi, xi and Xs 
associated with the arcs as shown, and densities dxx, ^xi, <5x3? associated as shown. The 
model has states = {wi, W2, W3}, transition probabilities Wu wu ^^2, m and Wj 
associated with the arcs as shown, and densities ^^i, 3w2, dwz, associated as shown, hi the 
5 following discussion, both a state and a transition probability will have the same notation. 
For example, a state <X/> in the evaluation model has transition probabilities x/ and xi . 

From the two hidden Markov models shown in FIG. 3, a product machine 
may be created. Turning now to FIG. 4, this figure shows synthesizer model 310, 
evaluation model 320 and a product machine 430. The product machine is created from 

1 0 the synthesizer 310 and evaluation 320 models. 

It should be noted that these probability density functions for both models 
may be determined in part by the history, A. This can allow the product machine 430 to 
account for context. It should also be noted that the hidden Markov models for the 
evaluation model 320 and synthesizer model 310 may include additional transitions not 

15 shown. For instance, there could be a transition from state <X2> to state <X}> and from 
state <X3> to state <X2> for the evaluation model. The product machine 430 and the 
probability flow matrix, to be described below, can take these transitions into account. 

Product machine 430 comprises a number of product machine states 
<WiXi> through <W3X3>, These are called product machine states simply to differentiate 

20 them from the hidden Markov model states associated with synthesizer model 310 and 
evaluation model 320, Each product machine state is named with a pair of states — one of 
the states of the synthesizer model 310 and one of the states of the evaluation model 320. 
Thus, product machine state <W}X}> corresponds to state <Wi> of the synthesizer model 
310 and state <x;> of the evaluation model 320, while product machine state <W3X2> 

25 corresponds to state <W3> of the synthesizer model 310 and to state <X2> of the 
evaluation model 320. Therefore, each product machine state corresponds to a pair of 
states of the synthesizer 310 and evaluation 320 models, where "pair" is to be interpreted 
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as one state of the synthesizer model 310 and one state of the evaluation model 320. 

For a synthesizer model 310 having m states and an evaluation model 320 
having n states, there will he n x m product machine states in the product machine 430. 
Thus, for a synthesizer model 310 having 3 states and an evaluation model 320 having 3 
5 states, there will be 9 product machine states in the product machine 430. 

Each product machine transition has a probability associated with it. These 
transition probabilities are determined as the product of the corresponding transition 
probabilities in the synthesizer model 310 and evaluation model 320. Each product 
machine transition corresponds to a possible transition a state of the synthesizer model 
10 310 and a possible simultaneous transition within evaluation model 320. For instance, 
% product machine state <WiXj> corresponds to state <Wi> of the synthesizer model 310 

W and state <x;> of the evaluation model 320. The possible transitions out of state <wj> of 

fft the synthesizer model 310 are the transition from state <Wj> to state <W2> (having 

% probability wi) and the transition from state <wj> to itself (having probability wi). 

vCP 15 Therefore, the two allowed transitions are either looping back to the same state or leaving 
[J this state. Similarly, the two allowed transitions for state <xi> of the evaluation model 

J;^ 320 are looping back to the same state (having probabilitjcci) or leaving this state (having 

probability xi). 

1;:=^^ Therefore, the transitions from product machine state <WiX]>, which is 

20 essentially a combination of state <wj> of the synthesizer model 310 and state <x;> of 
the evaluation model 320, are as follows. One product machine transition is a loop back 
transition to the same state, which has probability of <WjXj>. When there is a transition 
out of state <>V7> of the synthesizer model 310 but there is a loop back for state <x;> of 
the evaluation model 320, then the product machine state <WiXj> will transition to 
25 product machine state <W2X]>, according to the probabihty vPiXi . Similarly, when there is 
a loop back for state <w/> of the synthesizer model 310 but a transition out of state <x/> 
to state <X2> of the evaluation model 320, then the product machine state <WjXi> will 
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transition to product machine state <wiX2>, according to the probabihty wixi. Finally, 
when state <W}> of the synthesizer model 310 transitions to state <W2> and state <xi> of 
the evaluation model 320 transitions to state <X2>, then the product machine state <wiXi> 
will transition to product machine state <W2X2>, according to the probability w\x\, 
5 The other product machine transition probabilities and product machine 

transitions may be similarly determined. 

Note that each product machine state has a synthetic likelihood associated 
with it. These synthetic Hkehhoods are shown in FIG. 5. For instance, product machine 
state <WiXi> would have the synthetic likehhood k(Sw\ \ Sxi) associated with it, product 

10 machine state <wiX2> would have the synthetic likelihood K{dw\ \ ^xi) associated with it, 
and product machine state <W2X3> would have the synthetic hkelihood K(dw2 \ 5x3) 
associated with it. Note that these generally will be synthetic hkehhoods that have been 
compressed through the compression techniques previously described. 

Referring now to FIG. 6, with appropriate reference to FIGS. 4 and 5, FIG. 

15 6 shows a probability flow matrix 600 that is populated using the product machine 430 of 
FIGS. 4 and 5. Matrix 600 comprises "from" row 630 and "to" column 640. An element 
in the matrix will contain a probability value only if a transition exists from one of the 
product machine states in row 630 to one of the product machine states in column 640. If 
there is no transition, the matrix element will be zero, meaning that there is no probabihty 

20 associated with the element. Each element of probabihty flow matrix 600 therefore 
corresponds to a potential transition between product machine states. Which elements of 
matrix 600 that will contain probabihties depend on which transitions are vahd in the 
hidden Markov models for the evaluation and synthesizer models. Similarly, which 
potential transitions between product machine states that will correspond to actual 

25 product machine transitions between the product machine states will also depend on 
which transitions are valid in the hidden Markov model for the evaluation and synthesizer 
models. 
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The following examples illustrate these principles. There is a product 
machine transition from product machine state <W}Xi> (of row 630) to product machine 
state <w/X;>(of column 640). In product machine 430, this is the loop back transition that 
has transition probabihty <W}Xi>. Thus, element 601 has a probabihty associated with it. 
5 This probability is expressed as wix\K{S^i\dx\\ where the synthetic likelihood is 
generally a synthetic likelihood that has been compressed (as previously discussed). 
However, there are no additional product machine transitions into product machine state 
<WiXi>\ therefore, the rest of the row in which element 601 occurs contains elements that 
are zero, which means that no probabilities are associated with these elements. In fact, the 
10 diagonal elements should each contain probabilities, as there are loop back probabihties 
for each of the exemplary product machine states of FIG. 6. There is also a product 
m machine transition from product machine state <WiXi> (of row 630) to product machine 

?r} state <W2X}> (of column 640). Thus, element 606 contains a probability. This probability 

% is expressed as wiX\K{d^i \ 5x\)^ Similarly, there is a product machine transition from 

# 15 product machine state <W2X3> (of row 630) to product machine state <W3X2> (of cohmm 

640). Thus, element 623 contains a probabihty. This probability is expressed as 
f7 W2X3K:(5w2|f5x3). Howcvcr, there is no product machine transition from product machine 

# state W3X3 to product machme state <W2X3>, and this element is zero. 

ij If this process is continued through all of the elements of matrix 600, there 

20 will be 25 elements that contain a probability and 56 elements that are zero. The 25 
elements that contain a probabihty are numbered 601 through 625. Consequently, matrix 
600 is a sparsely populated matrix. 

It is also possible to mathematically determine whether an element of 
matrix 600 should be populated. A matrix element My, where each i orj refers to a pair of 
25 states from the evaluation and synthesizer models, will be zero unless 

Mij = r(j^i)'K(j^i), (31) 
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which means that a product machine state j transitions to product machine state /. The 
term x(j 0 indicates the synthetic probabihty of transitioning between these states and 
the term K(j 0 indicates the synthetic likelihood for the product machine transition. For 
5 example, Mes is element 614 and contains a probabihty because product machine state 
<W2X2> transitions to product machine state <W3X2> (see FIG. 4). The associated 
probability will be wixiKid^i \ ^^a). On the other hand, there is a zero for Mjs because 
there are no transitions between product machine state <W2X3> and product machine state 

<WiX3>. 

10 It should be noted that probability flow matrix 600 (also referred to as the 

matrix "ikf ' below) could contain more non-zero elements if the hidden Markov models 
for the evaluation model 320 and synthesizer model 310 are changed. For instance, if 
there is an allowed transition from state <W2> of the synthesizer model 310 to state <wi> 
of the synthesizer model 310, matrix element M;^ would contain a probabihty. 

15 Because of the sparse nature of probability flow matrix 600, there are 

some speed enhancements that may be used to dramatically decrease computation time 
and resources when calculating conflxsability using a probabihty flow matrix. These 
enhancements are discussed in more detail below. 



20 (^Mf\(l\ Product Machine Determination and Confasabilitv 

Now that a visual representation of a product machine and its resultant 
probabihty flow matrix have been described, a more formal argument concerning these is 
given. This section develops a construction that defines a confusabihty measure between 
arbitrary hidden Markov models. This measure comprises observation sequences 
25 synthesized over all vahd paths of all lengths, and yields an efficient algorithm that gives 
an exact result. 

This approach in many ways resembles a forward pass algorithm, used to 
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determine the assignment by a hidden Markov model of likelihood to a given sequence of 
observations, which is now briefly reviewed. The forward pass algorithm operates on a 
trellis, which is a rectangular array of nodes with as many rows as the number of states in 
the model, and as many columns as the number of observations plus one. Each column of 
5 nodes constitutes one time slice of the trellis. Starting with hkelihood 1 assigned to the 
initial state at the first time shce (and hence mass 0 assigned to all other states in this time 
slice), the algorithm assigns a likelihood to each state at each time, according to the 
equation 

10 f;' = i:yl>'T,s'dM (32) 

I'Ji which is called a forward trellis equation. Here yl is the likelihood of state s at time t, 

"J;i: T, ts^s is a transition probability, and ds'sioO is the likelihood of the observation 

recorded at time t. The notation ^ ^ on the summation means that the simi is taken over 
15 all states s' with arcs incident on For a sequence of T observations, the value 

computed for the final state F at the last time slice T is then the hkehhood that the model 
l"^' assigns to the complete sequence. 

p: As in the forward pass algorithm, this analysis proceeds by unrolling a 

^ " state machine into a trellis, writing suitable forward trellis equations, and computing 

20 probabihties for trellis nodes. The key difference is that this analysis does not operate 
with respect to a given sequence of true observations. Here the observations are being 
synthesized. This means that there is no natural stopping point for the trellis computation. 
It is unclear as to what time T should be declared as the end of the synthesized 
observation sequence. Additionally, it is unclear as to what time T the mass, assigned to 
25 the final state in a time shce, should be taken as the synthetic sequence likehhood. To 
resolve this problem, the analysis operates on an infinite trelhs and sums the probabihty 
assigned to the final state over all time slices. 
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From the two hidden Markov models shown in FIG. 3, the product 
machine iif^ix is defined as follows. First, the following notation and definitions are set 
out: 



Qw\x = Qw^Qx 
qw\xi = {qwh qxi) 
qw\xF ~ iqwF, qxp) 

'^W 1 X {Wjn^r){Wn^r) ~ '^W mu'^ X rs 



a set of states 
an initial state 
a final state 

a set of transition probabilities. 



10 The states and transitions of this product machine have been shown in 

FIG. 4. The synthetic likelihoods are shown in FIG. 5. Although superficially i/H^j;^ shares 
many of the characteristics of a hidden Markov model, it is not in fact a model of 
anything. In particular the arcs are not labeled with densities, from which observation 
likelihoods may be computed. Instead, an arc {wm.Xr) {wn.Xs) is labeled with the 

15 synthetic likelihood K:(Jwm« I Sxrs), and this quantity is treated as the likelihood, 
according to dxrs^ of observing a sample generated according todwmn* 

Now observe that any path taken through the state diagram of H^u is a 
sequence {w^x^ ), (w^x^ ) . . . of pairs of states of the original machines, and H^^ There 
is a natural bijection between sequences 7i^\x of state pairs, and pairs of state sequences 

20 {tlwTZx)^ Moreover, every pair {tIw^x), of vahd paths of identical lengths in H,^ and Hx 
respectively, corresponds to a path in //w|j? ^d conversely. Thus, a computation that 
traverses all valid paths in Hw\x comprises all pairs of same-length valid paths in the 
synthesizer and valuation models. 

A treUis for the state-transition graph of FIG. 4 is constructed, and 

25 appropriate forward trelhs equations are written, with synthetic likelihoods in place of 
true observation probabilities. This is shown in FIG. 7. The left panel of FIG. 7 shows 
two successive time slices in the trelUs. The arcs drawn correspond to the allowed state 
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transitions oiHw\x' 

Now the forward trellis equation is derived for state {wxxi), as pictured in 
the right panel of FIG. 7. The aim is to obtain an expression for y^^^^^^), the likelihood of 
arriving at this state at time r + 1 by any path, having observed t frames of synthetic data 
5 for w, as evaluated by the densities of x. It is apparent from the diagram that there are 
only two ways that this can happen: via a transition from {w\X\) and via a transition from 

Suppose that the synthetic likelihood of arriving in state {wixi) at time t 
by all paths is y\y^^^^)- The probabiUty of traversing both transition w\^w\ in and 
10 transition xi ^xiinHxiswixu and the synthetic likelihood of the data corresponding to 
this transition pair is k{Swi I Thus, the contribution to y^(^^^^) of all paths passing 
f^: through ( w 1 X 1 ) at ns 

k(5wi \ Sxi)wixiy[^^^^y (33) 

y Likewise, the contribution from paths passing through {w \X2 ) at ns 

5 '^^(^'^i I ^xjwixay^^,^^). (34) 

20 Since these paths pass through different states at time t they are distinct, so their 
probabilities add and the following results 

yt,x2)=^^^^i I ^5x,Vixi7(^^l,^) + ?c(^>v, I c5,,ViX2y[^^^^), (35) 

25 the forward trellis equation for state {wixi)- In a straightforward way, one can write such 
an equation for every state of H^\x- 

The following is a crucial observation. Writing for the distribution of 
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probability mass across all nine states of Q^^j^at time one can write 




and likewise for the same vector one time step later. Given these definitions, by the 
argument presented above, the complete family of trellis equations is expressed as 

y-^'^My'. (37) 

Here M is a square matrix of dimension x n^, where m = \Qx\, 
n=^\Qw\ Sindmxn = \ Qw\x\, since | Qw\x\ = Igwl ' \Qx\^ Note that the elements of M do not 
depend at all upon t. The sparse structure of Mis a consequence of the allowed transitions 
of Hw\x, and the numerical values of its entries are determined by transition probabilities 
and synthetic likelihoods. 

By assumption, at time 0 all the probabiUty mass in is concentrated on 
the initial state (k^ixi >, thus = (1 . . . 0)^. By iteration of Equation 37, the sequence of 
distributions are obtained as follows: 

f=My\ f^My^ = M^y\ f=My^=M^y\ ... (38) 

or in general y^ = My^. The question still remains as to what the total probabihty is, over 
all time, of arriving in the final state (w3;c3) of Hw\x. The quantity is written as c^^^ix- By 
summing the set of equations in Equation 38 over all time we have the following result: 

^M^-[f\^,x,y^[y'\^,x,) + ^^^ ' • • (39) 

= [f^M'y\M'y\M'y\ • • • I, (40) 
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where the notation [ 1(^3X3) denotes the extraction of element (^3X3) of the vector 
5 enclosed in brackets. 

It remains to show how the sum / + M + + + • • • can be computed. 
This may be written as follows: 



5'«=/+M+M^+M^ + --. (42) 

10 

note that this defines an array of individual real- valued sequences. Since every element of 
M is positive, so too are the elements of M for arbitrary j. Hence all the sequences 
defined by Equation 42 are positive and increasing. The matrix M is convergent if each of 
the individual sequences of 5"^ converges as 72 qo, in the usual sense of convergence of a 

15 sequence of real numbers. This sequence is written as 5"= hm^^oo Sn if these limits exist. 
For notational convenience, for the rest of this section all limits will be taken as w ^ oc. 
The following may be proven: if M is convergent then ^ 0; ii Sn^S and A is any 
m X m matrix, then ASn -> AS and SnA if M is convergent then (/ - My^ exists and 
S = {I - My\ Proofs of these are shown in Printz et al., "Theory and Practice of Acoustic 

20 Confusability," Automatic Speech Recognition: Challenges for the New Millenium, 
Spoken Language Processing Group, Paris, France, Sept. 18-20, 2000, the disclosure of 
which is incorporated herein by reference. Thus, / - M is invertible, and S = limn^oo Sn is 
its inverse. 

Note that a sufficient condition for M to be convergent is that each 
25 eigenvalue A of M satisfy Ul < 1. For then the Jordan canonical form of Af is convergent, 
by a simple comparison argument with geometric series, and hence so is M itself This is 
explained in more detail in Herstein, "Topics in Algebra," John Wiley and Sons, New 
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York, NY, section 6.6, 1975, the disclosure of which is incorporated herein by reference. 
If Mis convergent then the following occurs: 

' UMil-Mrf\^^^^y (43) 

5 

But it is observed that the vector (/-M)~^y^ is just the {w\x\) column of 
the matrix (/-M)"\ and the (^3x3) element of this vector is sought. More generally, if 
is y^, which is to say an m-element vector with a 1 in the position corresponding to the 
initial state of Hy,\x, and with Os everywhere else, and if up is defined likewise, except 
10 with a 1 in the position corresponding to the final state ofHyv\x-> then 

^^lx = li}(I~-Mr%. (44) 

This is the fimdamental definition of the confiisability of given H^. It is 
15 the estimate of the hkelihood, according to model i/^, of observing acoustics synthesized 
according to H^. 

The models Hy, and have been treated abstractly, but it should be clear 
that it is intended for each one to represent a lexeme. Thus, Hw is the hidden Markov 
model for some lexeme l(w), and Hkewise i/^ for l(x). To exhibit this exphcitly the 
20 notation may be changed slightly to write 

I /(^) h) = u} il~M(liw) 1 Kx) hW^ui. (45) 

The new quantity ^ is introduced in Equation 45, rather than writing 
25 p{l{w) I l{x) h) outright on the left hand side of Equation 45, because practical experience 
has shown that Equation 45 yields exceedingly small values. Using raw values as 
probabihties in computations rapidly exhausts the available precision of computers. For 
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this reason Equation 45 is renomialized via the following: 



5 where B represents the finite, discrete set of baseforms (lexemes) present in a speech 
recognition system. The parameter X is present in Equation 45 because without it the 
probabihties are too sharp. To correct for this problem, the raw ^ values are raised to a 
fractional power less than one, which reduces the dynamic range of the entire set. The 
best value of X is experimentally determined, by determining the true error rate for some 
10 joint corpus {C,A), and then adjusting X to make the synthetic word error rate Sa{P, C) 
(see below) match the true value. Experimentally, -0.86 is a good value for 2, 



(4^ DETERMINATION OF ACOUSTIC PERPLEXITY 

15 Once confusability has been determined for a number of word pairs, the 

acoustic perplexity is determined. As previously discussed, acoustic perplexity may be 
determined through the following equations, repeated here for convenience: 



YA{Pe.QA) = Pe{C\Ar''^\ (2) 

20 

where PeiC \ A) may be determined through any of a number equations such as 
equations 4 through 6. 

It can be shown that under certain conditions acoustic perplexity and 
lexical perplexity (see Equation 1) are equal. See Printz et al., "Theory and Practice of 
25 Acoustic Confusability/' which has been previously discussed and which also discloses 
additional properties of acoustic perplexity. Li Printz, it is also shown that acoustic 
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perplexity is a better predictor of the quality of a language model than is lexical 
perplexity. Acoustic perplexity may also be used during language model generation, and 
should provide an improvement over current language model generation techniques. 

5 (5) DETERMINATION OF SAWER 

Synthetic acoustic word error rate (SAWER) is even better than acoustic 
perplexity in predicting the quality of a language model. SAWER is expressed as follows: 

Sa(P, C) = t^ 1(1 ~p(wi I a(wt) hi)\ (47) 

10 

SAWER, which is Sa{P, C), may be used not only to predict the quality of a language 
model but also in language model generation. In Printz, it is shown that SAWER is a 
better predictor of the quality of a language model than is lexical perplexity and acoustic 
perplexity. 

15 

(6) MULTIPLE LEXEMES 

So far a closed-form analytic expression for p{l{w) | l{x)h) has been 
developed; by Equation 1 1 the results may be combined for the various /(x) e x to yield 
p{l{w) \ xh) . However word w itself may admit several pronunciations. 
20 To account for these multiple pronimciations, assume that a{w) is a set 

comprised of all the l{w) e w, and furthermore treat the various l{w) as non-overlapping. 
It then follows that 

p{a{w)\xh)= i; p{l{w)\xh), (48) 

l{w) e w 

25 

This expression treats the differing pronunciations l(w) of w as non-overlapping events in 
acoustic space. An alternative law for combining these probabilities is via 
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pia(w) \xh)) =Tl p(Kw) I X hy^^^''^ 1 " (49) 

I(w)ew 



This is the likelihood of a synthesized corpus of many pronunciations of 
5 the word w, with the relative frequencies of the different lexemes taken as proportional to 
their prior probabilities, ;?(/(>v) | w h). 

m CONTINUOUS SPEECH 

The method developed so far applies only to the recognition of discrete 
10 speech. Li this section it is shown how it extends naturally to continuous speech, with 
minor modifications. 

For the continuous case, an expression for p(a(hwr) \ xh), where h and r 
are the left and right contexts respectively, is sought. 

The following equation is useful in the development of an expression for 
15 p(a(hwr) \ xh). Let X.XuXzczQx, with X-XiU^2, Xi 0^2-0, and Feay. 
Providing that P(Xi) > 0, and P(X2) > 0, then 

Pim=PiY\Xi)P(^i\^'^P(Y\^2)P(X2\X). (50) 

20 This is Lemma 1 in Printz; the proof follows from elementary probability theory. Using 
Equation 93 A, then /)(a(A w r) | x/z) becomes 



p(a(hwr) \ xh)= X p(a(hwr) \ l{x) h) * p(j{x) \ xh) (51) 

1{x)ex 



25 and thus it suffices to develop an expression for p{a(hwr) \ l{x) h). Next, an argument is 
developed to show that in the Bayes inversion formula, Equation 7, repeated here for 
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r , , n \\ piaQiwr) 1 wh)pe{w \ h) 
pe{w ha{hw r)) = ^ r— r- — "t — . , (/) 



5 p{a{h w r) \ l{x) h) may be replaced by p{a{w) \ lix) h) as developed above. 

To see this, begin by splitting the acoustic event a{hwr) as a{h) a{w) air) 
making no declaration about how the division boundaries are chosen. Then under the 
assumption of frame-to-frame independence, the following may be written 

10 piaihwr) \ Hx) h) ^ p{a{h) \ l{x) h) ^ p{a{w) \ l{x) h) - pia{r) \ l(x) hi (52) 

The first and last factors on the right hand side may now be examined. 
First note that the quantity p{a{h) \ l(x) h) is essentially a giant acoustic encoding 
probability, with conditioning information l(x) tacked on. It will now be shown that this 
1 5 quantity is effectively independent of x. Nominally 



piaih) I (x)h)= X piaih) 1 lih)l(x))p(l(h) I hl{x)X (53) 

l{h)eh 



where the sum proceeds over all distinct pronunciations of A. Of course 

20 

pirn I h lix)) ^ pirn I h) (54) 

to a good approximation. Moreover for any long history h each summand 
piaih) I lih) lix)) is dominated by the computation associated with /(A), and hence 
25 largely independent of lix). But for a short history aihwr) ^ aiw r), and so the factor 
under discussion is approximated by 1 in Equation 52. Thus, it has been established that 
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p{a{h) i Kx) h) ^ p(a(h) \ A) = C/^? where this defines C^. 

Now consider p(a(r) \ l(x) h). Since a(r) and a(h) are separated by the 
acoustic event a{w), they are decorrelated as signals. The degree to which this holds of 
course depends upon a(w). Thus, the following is written p(a{r) \ l(x)h) ^ p(a{r) \ /(x)). 
5 Now it is argued as above that piair) \ lix)) is either independent of l(x) or should be 
replaced by 1. Either way, the following is written: piair) \ lix)) ^ piair)) = cOr, defining 

Substituting in Equation 52 above, the following maybe written 

10 piaihwr) I lix) h) ^ Ch ^piaiw) \ lix) h) • cOr (55) 

where Ch and cOr are both independent of x. Carrying this result into the Bayes formula, 
Equation 7, the following is obtained 

15 piw I aihwr)h) 

piaihwr) I wh)'piw \ h)'piw | h) 



20 



wr) I xh)*pix \ h) 

iiw)ew piaihwr) I liw) h)'piliw) I wh)'piw \ h) 
HxTi Kx)exPi<^ihwr) I lix) h) -piliw) I X h) *pix \ h) 

S i(w)€w Ch ^piaiw) I liw) h) - COr -pUiy^) I h) ^piw | h) 
S X 2 Kx)ex Ch 'piaiw) I lix) h) • COr 'piUx) I X h) 'pix I h) 

piaiw) \ w h) * piw I h) 
M'^(^) I xh)'pix I h) 



(56) 



(57) 



(58) 



(59) 
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just as for the case of discrete speech. Thus, to a reasonable approximation, 
p{w I a(hwr)h) is the same for both continuous and discrete speech. 

However, spoken language is a physical phenomenon, generated by 
movements of bone, flesh, muscle and air. These are physical entities with non-zero mass 
5 and hence non-vanishing inertia. The speech apparatus therefore cannot more 
instantaneously from one conformation to another, and hence successive frames of speech 
are necessarily correlated. 

Because of this, Equation 59 may be modified by adjoining to a(w) some 
small portions a(h^) and a(r^) of the full left and right acoustic context, perhaps a phone 
10 or two to the left and right respectively. Thus, Equation 59 is modified to 

{ \ n \L\ piajh^^r^) I wh)'p{w \ h) 

p{w a{hwr) h) ^ — — ^-jj- — m — TTTT* ^^^^ 

L p{a{h^wr^) I xh)'p(x \ h) 

This corresponds to extending, with appropriate arcs and transitions, the 
15 synthesizer model when building the product machine of FIGS. 4 through 6. 

(8) TRAINING OF LANGUAGE MODELS 

Two measures have been defined, which are acoustic perplexity and 
synthetic acoustic word error rate. These two measures are superior to lexical perplexity 

20 in measuring the quahty of a language model. Lexical perplexity is fiirther used, not only 
to evaluate the strength of a language model, but also in deciding upon parameter values 
describing the language model probabilities. Wlien determining lexical perplexity, the 
individual probabihties p(w \ h) may be decided upon by minimizing the lexical 
perplexity of a training corpus, Yiip, C) = ;?(C)~^^'*^'. The optimum may in this case be 

25 computed analytically and is given explicitly as the relative frequency 



YOR920000210US2 



-42- 



(61) 



A pursuit of a minimizing value for the acoustic perplexity of the synthetic 
acoustic word error rate would yield laaguage models with improved speech recognition 
5 performance. However, the situation for acoustic perplexity and synthetic acoustic word 
error rate is however not as simple as for lexical perplexity. There is no longer an explicit 
analytic expression for the global minimum for either of these two measures. A numerical 
apparatus needs to be developed to aid the search for a global (or local) minimum. 

Consider the simple case where all words have a unique pronunciation. 
10 Then by Equation 28, the following occurs 



( X 1 ( w P(^M I wh)p(w I h) 

p{w 1 h a{w)) - { f \ \ — rW i 7 V 

LxevpKa{w) I xhjpKx I h) 



15 The language models themselves are being sought and the following may 

be written: Xy^h '^'=p('^ \ h). The acoustic perplexity of a corpus C is 
YaU, C,A)-PxiC 1 AY^^^^^. Minimization of this quantity is equivalent to maximizing 
the quantity - log 7^(2, C,A). Thus, 



20 -log YaU, QA) - 2. log T-z — r-^ — -T^ — 

= lki:-log YAiXXQAX 
wherec(w,A) := \{i : (wi.hi) = iw,h), z= 1,. . . , iC|}| and 



25 
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■log YA(lh,C,A):=I,c(w,h) log 



pjajw) I wh)Xwh ^ 



(63) 



Similarly minimizing the synthetic word error rate, SaU,C,A), is 
equivalent to maximizing 



^ ^ f2 r ^ T piaM I Wihi)Xy,^h^ 

' ' i=l L xevp{a(Wi) I Xhi)/.xhi 



10 where 



r \ 

p(a(w) I wAMw/t 

Kxev 



(64) 



15 



20 



Because the equations -log Ya U, C,^) := 2 - log Ya U, C,^) 
and 1 -iS^U, Cj^) = ]^ 2 //(I -SAQ^.h, C,A)) have been developed, it is clear that Xwh 
can be found by maximizing - log 7^ (A,//, C,A) and 1 -- Sa(KK C,A) for each value of h 
separately. Consider two slight generalizations of the functions for which further theory 
will be developed. 

Define the functions 



N 



a(l) = X Ci log 



(65) 



and 
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sU) = S log ' (66) 



Using the matrix notation^ = {ay} and B = {bij} ^=1^...^ the following 

j=C.M mI.,m 

5 may be written: (A?.)i for 2>i «y^v (^^)^ Sy^i '^y^y- The Equations 65 and 66 
then become 



10 and 



.a)=|c,^ (68) 



in this simpHfied notation. Therefore a(X) = Ya^K h, C,A) and s{X) ^ SA{k K C,A) when N 
15 ^M=\V\, 



J |w /z) ifx = w I 

^^1 0 otherwise \' 



bwx= Pia(w) I X h) and Cw = c(w, h). 

(8)(a) Steepest Descent Methods 

The objective now is to maximize a(X) and s(X) for values of A satisfying 
vly = 1. Noting that these two functions satisfy the function relations a(cl) = a(X) and 
s(cX)=s(X) for any oO, and letting/ : be a C\R^) function satisfying 



YOR920000210US2 



-45- 



■f{cX) ==f{X) for all OO, then the following holds 



M 



2:A.-#(^) = 0 V2ei?^. (69) 



5 A proof of Equation 69 is given in Printz. Equation 69 provides a method to locate 
incrementally larger values for /(I). The following theorem, referred to herein as Theorem 
1, specifies a direction that guarantees finding a better value unless a function is being 
maximized at a boundary point or a point where V/(/l) = 0. 

Theorem 1: Let f:R^^R be a C^(R^) function such that f{cX)=-fiX), 

10 I Xi = 1, li>0 for all f = 1, . . . ,M and both^(/l) ^ 0 and 0 < 2/ < 1 hold for some 

i=l, . . . ,M. Define X := Ai + e>l/-^(yl). Then there exists a sufficiently small e such that 
the following three properties hold 



S ^ = 1, (70) 



15 



and 



X>0 Vf=l, ,..,M (71) 
AX)>A^y (72) 



20 A proof of Theorem 1 is given in Printz. 

Theorem 1 only explains that some value of e > 0 exists, but the theorem 
does provide a concrete value of e for which the theorem holds. Such a value may in fact 
be found using theory developed in Gopalakrishnan, "An InequaUty for Rational 
Functions with AppUcations to some Statistical Estimation Problems/' IEEE Transactions 

25 on hiformation Theory, 37(1), pp. 107-1 13, 1991, the disclosure of which is incorporated 
herein by reference. This concrete value has however no practical value as it is far too 
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small to yield an efficient update rule. 



(8)(b) A Numerical Optimization Strategy 

Theorem 1 guarantees that a better value wilU be found by choosing 
5 Xi = Xi + £yii-^(/l), z = 1, 2j . . . , m for some value £ > 0 unless the function is at a local 
stationary point. To satisfy the constraint for 1,2, ... ,m it is also required that 
e<£max, where 



Smax — max 



10 



df 

e : 0<Xi-\-sli-^(X)< 1, z = 1,2, . . . ,m 



(73) 



if ^i-^i^) ^ 0 for some z = 1, 2, . . . , m and e^ax = 0 otherwise. 

To optimize a(X) and s(X) numerically, a standard line maximization 
routine is used to find a local maximum s^^^ for a(X ) and s(X ). The procedure is then 
iteratively repeated, making X be the new value of at each consecutive iteration. 

15 In numerical experiments, the golden section search method as well as 

Brent's method using derivatives were evaluated. These algorithms are described in Press 
et al, "Numerical Recipes in C: The Art of Scientific Computing," Cambridge University 
Press, second edition, 1999, the disclosure of which is hereby incorporated by reference. 
Slightly modified versions, of the implementations in Press of these two algorithms, were 

20 used. As the performance of the two methods was very close, the more robust golden 
section search method was chosen to be used in more extensive experiments. Also the 
routine for initially bracketing a minimum described in Press was modified to account for 
the additional knowledge, 0<£<£max- A procedure to follow is to iterate the line 
maximization procedure until a satisfactory accuracy for X is achieved or until a very 

25 large number of iterations have been made. 
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(9) SPEED ENHANCEMENTS WHEN CALCULATING CONFUSABILITY 

Both computations of a(X) and s(X) require the computation of the 
matrix-vector products AX and BX. The matrices A and B have dimensions nxm, which 
for a typical speech recognition vocabulary would be 70,000 x 70,000. hi other words A 
5 and B have 70,000^=4.9 x 10^ elements, hi reality A is stored as a diagonal matrix and B 
as a sparse matrix containing approximately 10^ elements. This means that the 
computation of a(X) and s{X) each requires at least 10^ multiplications. Repeated 
evaluation of a{?,) and s(?l) is going to be very costly and optimizing these functions will 
be infeasible unless there are computational savings when optimizing with respect to e. 
10 The most important saving comes from the observation that 



, ^ , (AX)i + 8(Av)i .... 



15 



and 



^. ^ ^ (AX)i + e(Av)i 



In particular the choice v = (vi, V2, . . . , v^), where Vi = Xt^ or = Xi^, depending on 
which function is being maximized, corresponds to maximizing a(X ) or s(X ) with 
20 respect to e > 0. 

If a = AX, P = Av, y^Bv and S==^Bv are precomputed then the cost of 
evaluating s(X ) for a particular value of is N divisions, 3A^ additions and 3N 
multiplications. For a(X ) the cost is an additional A'^ logarithm extractions. 

Rewriting the formulas in terms of a, y and J, the expressions occur 

25 
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and 



(76) 



(77) 



For Equation 76, this can be further simplified to 



10 



^(X) =E Ci log f-frl +S log 
i=\ ^''^ i=i 

=aa)+|c,iog(Y7g;-) 



<5, 



(78) 



where 



and di-Yi fo^ / = 1,2, . . . If ^ = (ei^ei, . . . ,^w) and 



d = {d\,d2^ ... are precomputed, the cost of evaluation ) according to Equation 
78 is actually 1 addition more than according to Equation 74, but the argument of the 
15 logarithm is now close to 1 for small values of e. This ensures an added degree of 
numerical stability that is well worth the additional computational cost. 

For Equation 77, the following illustrate additional computational savings. 



20 ^(A)=gc,y;+gc. ^^^^^ 



s(T)-s(l)^sij{^, (79) 



where di = 77 as before and/- = C;( ^'y°''^') for / = 1,2, . . . ,m. After precomputing a, 
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, d,d = (dud2, .^.,dm\ and f=(fxj2, ... the cost of evaluating s{X) for a 
particular value of £ according to Equation 79 is N divisions, N+1 multiplications and 
2N+\ additions. This is a total of 4A^+2 arithmetic operations. Evaluating s{X) for an 
arbitrary value of /I costs approximately 10^ arithmetic operations and evaluating s{X ) for 
5 a particular e costs approximately AN = Ax 70,000 ^ 2.8 x 10^ operations after all the 
precomputation is performed. This means that s{l ) can be evaluated for more than 350 
different values of e for the cost of one general fimction evaluation once a, y, S, d, e 
and /have been precomputed. The precomputation of a, y, S, d, e and/ costs roughly 
as much as two general function evaluations ofs(l ). But this precomputation cost can be 
10 cut in half by the observation that once a best choice of i.e. e^^^, is found, the next 
values for a and y may be computed by the formula 



a 



+ 8 



opt 



old 



(80) 



15 The only costly precomputation step left is then to recompute ^ and ^5 for 

each new choice of X, 

In an efficient implementation of the optimization ofs(X)^ the best s can be 
determined for the cost of only slightly more than one general function evaluation per 
consecutive line maximization if m and N are as large as 70,000, 

20 

(9)(a) Efficient Algorithm for Computing Confusability 

Returning to FIG. 6, this figure shows a probabihty flow matrix 600 that 
has a sparsity structure that corresponds to the product machine shown in FIGS. 4 and 5. 
As previously discussed, the confusability ^^^ix is defined by Equation 43, repeated here 
25 for convenience: 
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^^\^^ulil-M) (43) 

where M is the probabihty flow matrix. La this section the disclosure gives an efficient 
algorithm for computing this quantity. 

(9)(a)(l) Conditions Required to Apply the Algorithm 

Two conditions must be satisfied to apply the algorithm. First, the 
synthesizer and evaluation hidden Markov models (hereafter "HMMs"), used to construct 
the product machine, must have so-called "left-to-right" state graphs. The state graph of 
an HMM is left-to-right if it is acyclic (that is to say, contains no cycles) except possibly 
for self-loops (that is to say, transitions fi*om a node to itself). The terminology 
"left-to-right" suggests that this idea has something to do with the way a state graph is 
drawn on a page, but in fact its meaning is the purely topological one just given. The 
HMMs in Figure 3 and Figure 4 are all left-to-right. 

The HMMs that appear in speech recognition are almost always 
left-to-right, thus the technique described here can be very widely applied. It should be 
noted however that even if the underlying synthesizer and evaluator HMMs are not 
left-to-right, precluding the use of the efficient algorithm that is described below, the 
general method of computing confiisabilities by Equation 43 above may still be applied. 

Second, the method described here is efficient in part because the 
maximum indegree (that is, the number of transitions or arcs impinging on any given 
node) of the synthesizer and evaluation models is bounded above by a fixed nimiber. The 
reason for this efficiency is explained further below. For the particular HMMs considered 
in FIG. 3 (and also FIG. 4), this bound is 2, and in the discussion that follows, the 
technique will be explained as if this were always so. However, the method applies no 
matter what the true value of this bound is, though the efficiency of the method may be 
reduced somewhat. 
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Two important properties of the product machine follow from these 
conditions. First, because the state graphs of the synthesizer and valuation models are 
both left-to-right, so too is the state graph of the product machine that is formed from 
these two models. As a result, the states of the product machine may be assigned 
5 numbers, starting from 1 and proceeding sequentially through the number of states N of 
the machine, in such a way that whenever there is an arc from state number r to state 
number 5, it follows that r<s. Such an assignment will be called a "topological 
numbering." In particular, this numbering may be determined in such a way that 1 is the 
number of the initial state and is the number of the final state. 
10 Second, no state of the product machine has more than 4 arcs impinging 

^jj on it, including self^loops. This is a consequence of the bounded indegree of the 

CO synthesizer and valuation models, whose product was taken to obtain the graph of the 

ffi, product machine. Li general, if the synthesizer model has maximum indegree and the 

valuation model has maxinmm indegree Dx, then the maximum indegree of any state of 

# 15 the product machine is Dw>^Dx- For instance, in the examples of FIG. 3 (and also FIG. 
f;j 4),Z)^ ~Dx = 2, and the product machine 430 of FIGS. 4 and 5 has maximum indegree 

# The significance of this bound is as follows. It is evident from Figure 6 
that not every possible state-to-state transition in the product machine 430 of FIGS. 4 and 

20 5 is present. This means that only certain elements of the probabihty flow matrix M may 
be non-zero. Indeed, the maximum number of non-zero entries in any colunrn of Mis the 
maximum indegree of the product machine. Thus, carrying this example a httle further, 
the maximum number of non-zero entries in any column of M is 4 = Dw xDx^ As a result 
the total number of non-zero entries in the entire matrix M is no greater than DyvXDxXN, 

25 where N is the total number of states in the product machine (and hence also the number 
of rows or columns of the matrix M.) This property will be made use of later. 
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(9)(a)f2) Detailed Description of the Algorithm 

Return now to Equation 43 above. Recall that uj is a vector with the same 
number of elements as there are states in the product machine, and with a value of 1 for 
the element that corresponds to the initial state of the product machine, and a 0 for every 
5 other element. (It is being assumed now that the states of the product machine, which are 
formally named by pairs of states (Wr,x^), have been assigned numbers as well, in such a 
way as to constitute a topological numbering.) 

Likewise up is a vector with a 1 for the element that corresponds to the 
final state of the product machine, and a 0 everywhere else. Thus, Equation 43 selects a 

10 single element of matrix (I-~M)~\ namely the one that lies in the matrix column that 
corresponds to the initial state of the product machine, and in the row that corresponds to 
the final state of the machine. For the example considered here, these are respectively the 
first column (/' = 1) and the last row (/ = 9). Consequently, only one element of the matrix 
{I~A4)~^ is needed to determine confiisability. Because only this element is necessary to 

15 determine confusability, certain simplifications may be made in the computation, which 
are now explained. 

To begin, note that in computing the quantity ^^\x for several word pairs 
w, X, the K -quantities for all pairs of densities are computed beforehand. Thus, only the 
computation the desired element of (I-M)"^ is left. It will be shown that because only 

20 one element of (I-M)~^ is required, a significant simplification in the algorithm obtains, 
as compared to computation of the complete matrix inverse. 

To compute this element, as is known in the art, the inverse of a matrix 
may be determined by a sequence of elementary row or column operations. See, for 
instance, Anton, "Elementary Linear Algebra," John Wiley & Sons, Section 1.7, 1973, 

25 the disclosure of which is incorporated herein by reference. The following explanation 
assumes that row operations are performed. The explanation could easily be modified so 
that column operations are performed. As explained in Anton, recall that to perform the 
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matrix inversion, is suffices to start with a subject matrix, in this case (/ - M), and 
perform a series of elementary row operations converting the subject matrix to /, When 
this same series of operations is applied in the same order to an identity matrix /, the 
result is the inverse of the original subject matrix. The difficulty with this method is that 
5 for a general A^x matrix, it requires 0{N^) arithmetic operations. 

A method is now demonstrated, exploiting both the sparsity structure of 
(I-M) and the need for just a single element of its inverse, that allows a very substantial 
reduction in computation, compared to the general method just cited. As an instructive 
exercise, consider the inversion of a 4x4 matrix (I-M), corresponding to some product 

10 machine. By the discussion in the previous section, it is assumed that the nodes in the 
product machine can be ordered so that the matrix M and hence also (I-M) is lower 
diagonal, that is, all non-zero elements he on or below the main diagonal. 

Assume that such an ordering has been performed and denote the elements 
of (/ - M) by ay, where the non-zero elements satisfy I <j<i<N. Assume also, with no 

1 5 loss of generality, that the index 1 corresponds to the start state of the product machine, 
and the index A/' corresponds to the final state of the product machine. This entails that the 
desired element of (I-M)~^ for these purposes is the row A^, column 1 entry. 

Now, how to apply a modification of the method of elementary row 
operations is discussed, and how to obtain the desired simplification in computation is 

20 demonstrated. First, write down an augmented matrix ((/ - M) | 7), consisting of (/-M) 
and / written side-by-side as shown below. 



((/-M)|/) = 



an 0 0 0 

(321 «22 0 0 

a^l 032 033 0 

(341 O42 O43 (344 



10 0 0 
0 10 0 
0 0 10 
0 0 0 1 



25 The element an is converted to unity through multiplication of the first 

YOR920000210US2 -54- 



row by d\ = \lan, which is an elementary row operation. If a similar operation is 
performed for each row of the matrix, consisting of multiplication of row / by di = llau in 
each instance, the following is obtained: 



{iI-M)\I) 



10 0 0 

621 1 0 0 

631 bz2 1 0 

Z?41 bAi 643 1 



n :=di 
0 
0 
0 



Formally, bij = aij/au for 1 <y < i<N whore di = llau. Here the symbol is used to 
denote similarity by means of a series of elementary row operations. The quantity n is 
defined as shown. 

10 Note that a dot has been placed in certain positions of the augmented 

matrix, replacing the Os or Is previously exhibited there. This is to indicate that 
operations upon these positions need not be performed, as these positions have no effect 
upon the outcome of the computation. This is because, by the earlier discussion, only the 
row 4, column 1 entry of (I-M)~^ is required. As will be seen, this quantity depends 

15 only upon operations performed in the first colimm of the right submatrix of ((/ - M) | 7); 
that is, the portion that lies to the right of the vertical bar in our notation. This 
simplication, which is the elimination of the need to perform any elementary row 
operations on the remaining A^- 1 columns of the right submatrix, constitutes the first 
major element of the present method. Since this eliminates 0{N^) arithmetic operations, 

20 compared to the general method for matrix inversion, it is a very significant 
simplification. 

Elementary row operations are performed to zero out the off-diagonal 
elements of {I~-M) one colunm at a time, starting with the leftmost column and 
proceeding through the rightmost. For example, operating now upon the leftmost 
25 column, to clear the bii element, multiply the first row by -hii and add it to the second 
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row. Likewise to clear the 631 element, multiply the first row by -Z?3i and add it to the 
third row, and so on through each succeeding row. After completing all operations 
necessary to clear the first column, the following obtains: 



1 0 0 

0 1 0 

0 632 1 

0 b^2 b43 



ri := -621^1 

-^31^1 
-^41^1 



10 



Having completed the operations to clear the first column, define the quantity n as 
shown. A similar sequence of operations are performed to clear the second column. This 
yields 



{(i-M)\r) 



1000 
0100 
0010 

^ 0 0 643 1 



r\ :=di 
ri :=-b2\r\ 
n := -bsiri - bsiri 
-b^iri-bAiri 



Finally the third column is cleared and the method ends with 



15 



i(I-M)\I) 



1 0 0 0 ri~di 

0 10 0 rz ■.^-b2iri 

0 0 10 rs := -632^2 -631^1 

0 0 0 1 r4 := -643'"3 - &42''2 - ^41^1 



Note that the original subject matrix (I-M), on the left-hand side of the augmented 
matrix, has been reduced to the identity, and hence its inverse (or rather the first column 
of the inverse, since operations on the other colurons were intentionally not performed) 
20 has been developed in the right half of the augmented matrix. Thus is the desired 
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element of {I-M)'^. Moreover, by comparison of the expressions for n through ta with 
the original matrix elements, it can be seen that: 




ri = - 



an ' 



rs =- 



CI33 



and 



r4 ^- 



a43r3 +a42r2 +^41^1 
(344 



10 It is apparent from this series of expressions that the value ta depends only on 
computations performed within its column of the right submatrix, utihzing either 
elements of the original subject matrix {I~M), or the values n through r^. This 
constitutes the second part of the present method: there is no need to operate upon the 
(I-M) half of the augmented matrix. The only arithmetic results that are important are 



15 the ones that are used in the determination of rA, and these are precisely the expressions 
for n through rs. This yields another savings of 0{n^) operations, compared to a general 
matrix inversion. 



some matrix entry aij (and hence also bij, after division by an) in a particular column j is 
20 already zero, due to the sparsity structure of (I-M), It is clear that in such a case, one 
need not perform any operation on the particular row i in question. (For after all, the only 
reason for operating upon row / in that column is to reduce bij to 0, and here one is in the 
salubrious position of finding this element already zeroed.) Since the number of non-zero 
elements in any given column of(I-M) is bounded above by 4 (in the general case by 
25 DwXDx), this means that no more than 4 (in general, D^xOx) elementary row 
operations need be performed in any given column. This is another significant savings, 



Consider now the effect upon the steps of the preceding description if 
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since it reduces 0{N) arithmetic operations per column to a constant 0(1) operations, 
depending only upon the topologies of the synthesizer and valuation machines. 

The algorithm can now be stated in complete detail. For an A/^xA^ matrix, 
the recurrence formulae are: 



and 



r. = ^^^%^, (80B) 



for i = 2^ . . . , where the curly brace notation means that the sum extends only over 
indices k such that aik ^ 0. The element sought is vn and so by the recurrence equations 
10 displayed above it suffices to compute for / = 1, , . . 

(9)(a)(3) Comparison with Prior State of the Art 

Li this section, it is shown that the method detailed above constitutes a 
significant advance in the state of the art. It should be noted that Equation 43 above by 

15 itself constitutes an advance, since it mathematically defines the notion of confusabiHty, 
with respect to a pair of hidden Markov models. The assertions of novelty and 
significance here pertain to the method just outlined for efficient computation of the 
desired element of {I-M)~^ . 

Let us return to the general recurrence formulae (80A) and (SOB) above. 

20 We now determine the number of arithmetic operations (additions, multiplications or 
divisions) entailed by these formulae. Consider the expression for the numerator of ru 
for z = 2, which is{L^k=\<^ik'^k)' Recall that the curly braces mean that the sum 
proceeds over non-zero entries atk of the probability flow matrix M (Note that the 
numerator of ri is just 1, a constant that requires no computation.) 

25 Now as previously demonstrated, the number of non-zero entries in M is 

bounded above by DwXDxX N, where A'' is the number of states of the machine, and also 
the number of rows (and columns) of M. Thus the total number of multiplications and 
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additions, required to compute all numerators of all the n values, for all / = 1, is no 
more than D^xDxX N. To this we also add a total of N divisions, since to obtain each 
final n value, the numerator must be divided by an in each case. Hence the total number 
of arithmetic operations of all kinds (that is, addition, multiplication or division) required 
to compute the desired value r^v is no more than DwXDxXN+N-(DwXDx-^l)'N, 
which is an 0{N) expression. This compares with 0(N^) operations for a general matrix 
inversion, and therefore represents a very substantial reduction in the number of 
arithmetic operations that must be performed. 

For example, recall that is determined as the product of the number of 
states in the synthesizer model and the number of states in the valuation model. Since a 
typical word will contain 5 phonemes, and each phoneme is modeled by a 3-state HMM, 
this means that N typically attains values of about (5 • 3)^ = 225. Thus we are reducing a 
computation that requires on the order of 225^ = 11,390,625 arithmetic operations to 
(4 + 1) • 225 = 1 125 operations. 

(9Kb) Computational caching 

If one wishes to compute | x for all w g F, which will typically be the 
case, there is a computational saving in reusing computations for similar words. Consider 
two words wi and W2, and suppose their pronunciations are identical through some initial 
prefix of phonemes. (Here we are supposing, as above, that each word has a single 
pronimciation.) If the synthesizer machines i/wi^nd are identical for states 
i = 1, . . . then the values of ay appearing in (I-M) for rows / - 1, . . . , w and all 
columns will be identical. Hence the values ri,r2, ...rm will be identical and once they 
have been computed once they can be stored and reused. This gives an improvement for 
long words that have pronunciations that begin in the same way, such as "similar" and 
"similarity." In this case, the full computation for "similar" can be reused for "similarity" 
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(assuming that right context is being ignored in determining HMM densities). 

An example of this is shown in FIGS. 8 and 9. Referring now to FIG. 8, 
this figure shows a synthesizer model 810, an evaluation model 820 and a product 
machine 830. Synthesizer model 810 comprises synthesizer model 410 (fi-om FIGS. 3 and 
5 4) and additional state W4. The evaluation model 420 is the evaluation model shown in 
FIGS. 3 and 4. Consequently, product machine 830 contains product machine 430 (shown 
in FIGS, 3 and 4) and an additional column of states. State W4 of the synthesizer model 
810 causes product machine states W4X1, and W4X3 and also causes the appropriate 
transitions between the product machine states. 

10 Turning now to FIG. 9, this figure shows a probabihty flow matrix 1000 

that is populated using the product machine 830 of FIG. 8. Also shown in FIG. 9 is a 
column 1030 that corresponds to the leftmost column of ((I-M) |/). Probability flow 
matrix 1000 contains probability flow matrix 600, which was shown FIG. 6. Additionally, 
the new state W4 of the synthesizer model 810 of FIG. 8 causes entries 1001 through 1010 

15 to be populated with probabilities. Determination of these types of probabilities has been 
previously discussed in reference to FIG. 6. From FIG. 9 and the previous discussion on 
Computational Caching, it can be seen that rj through rg will already be calculated when 
probability flow matrix 600 is used to determine acoustic confusability for synthesizer 
model 410 and evaluation model 420. Therefore, these may be held and reused when 

20 determining acoustic confusabihty from probabihty flow matrix 1000, which derives 
from synthesizer model 810 and evaluation model 420. This is a tremendous time 
savings, as rjo through r;2 are the only values that need to be determined when probability 
flow matrix 1000 is used to determine acoustic confusability. For instance, it could be 
that synthesizer model 410 is the synthesizer model for "similar" and synthesizer model 

25 810 is the synthesizer model for "similarity." The results rj through rp may be held and 
reused during the probability flow matrix calculations for "similarity." Likewise, the 
synthesizer model 810 could be the synthesizer model for "similar." The results for 
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"similar" could be reused when computing acoustic confusability for "similarity." Note 
that the ordering of the states of the models will affect whether caching can be used for 
prefixes, suffixes or both. 

5 (9)(c) Thresholding 

The idea of thresholding is that if one is not interested in computingc^^vix 
for words w and x that are highly distinguishable, then it is not necessary to complete the 
computation of^^\x when it is certain that it will be sufficiently small. In general c^^ix is 
used as a template for comparison and is thrown out if it is less than s^x\x for some 

10 user specified a. The imphcit assumption here is that ^x\x^^ likely to be large, compared 
to some arbitrary word w that is acoustically dissimilar to x. For this to be of any use, 
there needs to be a way of rapidly estimating an upper bound for ^y^\jc and stopping the 
computation if this upper bound lies below s$x\x- 

To do this, observe first note that the value of any given r/ in the 

15 recurrence equations (80A) and (SOB) above may be written as a sum of products 
between fractions atk/au and previously computed values. Thus we have the bound 

|r,| < (i - l)(maxi^<i \aifc\)(max{i,^i} rt) 



20 where the curly braces denote that the second max need run only over those k values for 
which Uik ^ 0. 

By using this bound, as the computation of ^^\x proceeds, it is possible to 
determine at some intermediate point that ^y^\x will never attain a value greater than the 
threshold e^x\x- At this point the computation of ^^;\x is abandoned, yielding another 
25 substantial computational savings. 
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(10) ADDITIONAL APPLICATIONS 

As previously discussed, acoustic perplexity and SAWER are beneficial 
when determining the quaUty of language models, and therefore in adjusting their 
parameters. However, there are additional applications of these quantities and methods 
5 for determining them. 



(lO)fa) Vocabulary Selection 

Consider a corpus C and a given recognizer vocabulary V. Suppose there 
is a set of "unknown" words U that appear in C but are not present in F. It is desired to 

10 determine which u&U,ii any, to add to F. 

First note why this is a problem. Augmenting F with some particular u will 
increase (from 0) the probability that this word will be decoded correctly when a system 
encounters it. But it also increases the probability that the system will make errors on 
other words, since there is now a new way to make a mistake, 

15 Adding any given u e Uto V allows one to proceed to estimate the change 

in error rate that follows from this addition. By the arguments given above, the SAWER 
on Cis 



20 



25 



Sav(Q = ^ S (1 -Pv(wi I a(wd ht)l (81) 



where denotes computation of confiisabilities with respect to the imaugmented 
vocabulary F. Assume th3l pv(w | a(w) h) = 0 when w ^ V, 

Suppose now that an augmented vocabulary V' - FU {u} is formed. Then 
recompute the synthetic acoustic word error rate as 

Sav'(Q - ^ S (1 (w, I a(wd hi)y (82) 
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It is hoped that SAr(C)< SAviQy in other words that adjoining u causes the error rate to 
drop. Thus, define A^, the improvement due to u, as 



10 



AuiQ^SAviQ-SAviQ - I^ipriwi \ a(wi) hi)) ^p,(wi\ a(wi) hi)). (83) 

Then, vocabulary selection is performed by ranking the elements of U according to Au, 
adjoining the element with the largest strictly positive improvement, and then repeating 
the computation. 



(lQ)(b) Selection of Trigrams and Maxent Features 

The present invention is also useful when selecting features for maximum 
entropy models, based upon the acoustic perplexity or SAWER, or their analogs for a 
general channel (such as translation). The present invention is additionally useful for 
15 selecting trigrams (or higher order ngrams), for extending lower-order ngram models, 
based upon the gain in acoustic perplexity of synthetic acoustic word error rate. 

That is, in a similar way to selection of words for vocabularies, one may 
ask what trigrams should be used to augment a base bigram language model. This 
question may be analyzed in terms of the effect this augmentation would have on both the 
20 acoustic perplexity and synthetic acoustic word error rate. 

Consider two language models: a base model p{w \ h) and an augmented 
model pxyz(w I A). Here the latter is obtained as a maximum entropy model, perturbing the 
base model according to 



p(w\h) • e^'^^-^^'^) 

" /-^W^)- ^ ■ (84) 

The exponent Xxyz is determined in the usual maximum entropy manner by the 
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requirement 



Ep^[C]=Ep[C\ (85) 

5 where ^ is an empirical model of the corpus. 

The most valuable trigrams to use to augment the base model need to be 
determined. These may be computed by using the decoding probabiUties determined with 
respect to these two different language models, respectively p(wi \ a(wi) hi) and 
Pxyz(}^^i I ci(wi) h). Define the gain, which measures value according to acoustic perplexity, 
10 via 



Gxyz = N log ~p(Q4)- (^6) 



Likewise define the improvement, which measures value according to synthetic acoustic 
1 5 word error rate, via 

A^z = i(PUC\A)-P(C\A)) (87) 

Both expressions are valid, and experimental methods can be used to determine which 
20 measure is appropriate to a particular task. 

(m ADDITIONAL CQNFUSABILITY CALCULATION 

Presented here are simpler confusabihty calculations than that presented 
above. These simpler confusabihty calculations use edit distances. An edit distance is the 
25 amount of phones that must be substituted, added, or removed to change a starting word 
into an ending word. For example, the word "the" comprises the two phones "TH UH" as 
one of its lexemes. The word "this" comprises the three phones "TH IX S." The edit 
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distance between "the" and "this" is two: one substitution of "IX" for "UH" and an 
insertion of "S." This is more easily seen in the following manner. Starting with the 
phoneme stream for "the," which is "TH UH," substitute "IX" for "UH," The phoneme 
stream is now "TH IX." Insert an "S" phoneme onto the end of this, and "TH IX S" 
5 results, which is the phoneme stream for "this." Thus, the edit distance between "the" and 
"this" is two, with one insertion, one substitution, and no deletions. 

Another technique that calculates edit distances when determining 
confusability is to weight the operations performed when changing a starting word into an 
ending word. For instance, deletions might be given a weight of three, substitutions a 
10 weight of one, and insertions a weight of two. Then, in the previous example, the 
modified edit distance would be 3, which is one substitution multiplied by the 
substitution weight of one, plus one insertion multiplied by the insertion weight of two. 

Another technique to calculate edit distances is to assign a cost of an 
operation when converting a starting word to an ending word. For instance, in the 
15 previous example, the cost of substituting IX in place of UH might be 0.5 while the cost 
of inserting S might be 2.7. The edit distance is therefore 1 substitution at a substitution 
cost of 0.5, plus one insertion at an insertion cost of 2.7, for a total of 3.2 edit distance. 

These weighted edit distances may then be used as the CQM \ Kx) h)in 

Equation 45. 

20 

(12) EXEMPLARY SYSTEM 

Turning now to FIG. 10, a block diagram of a system 1000 for determining 
and using confusability, acoustic perplexity and SAWER is shown. System 1000 
comprises a computer system 1010 and a Compact Disk (CD) 1050. Computer system 
25 1010 comprises a processor 1020, a memory 1030 and an optional display 1040. 

As is known in the art, the methods and apparatus discussed herein may be 
distributed as an article of manufacture that itself comprises a computer-readable medium 
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having computer-readable code means embodied thereon. The computer readable 
program code means is operable, in conjunction with a computer system such as 
computer system 1010, to carry out all or some of the steps to perform the methods or 
create the apparatuses discussed herein. The computer-readable medium may be a 
5 recordable medium (e.g., floppy disks, hard drives, compact disks, or memory cards) or 
may be a transmission medium (e.g., a network comprising fiber-optics, the world-wide 
web, cables, or a wireless channel using time-division multiple access, code-division 
multiple access, or other radio-frequency channel). Any medium known or developed 
that can store information suitable for use with a computer system may be used. The 

10 computer-readable code means is any mechanism for allowing a computer to read 
instructions and data, such as magnetic variations on a magnetic medium or height 
variations on the surface of a compact disk, such as compact disk 1050. 

Memory 1030 configures the processor 1020 to implement the methods, 
steps, and functions disclosed herein. The memory 1030 could be distributed or local and 

15 the processor 1020 could be distributed or singular. The memory 1030 could be 
implemented as an electrical, magnetic or optical memory, or any combination of these or 
other types of storage devices. Moreover, the term "memory" should be construed broadly 
enough to encompass any information able to be read from or written to an address in the 
addressable space accessed by processor 1010. With this definition, information on a 

20 network is still within memory 1030 because the processor 1020 can retrieve the 
information from the network. It should be noted that each distributed processor that 
makes up processor 1020 generally contains its own addressable memory space. It should 
also be noted that some or all of computer system 1010 can be incorporated into an 
application-specific or general-use integrated circuit. 

25 Optional display 1040 is any type of display suitable for interacting with a 

human user of system 1000. Generally, display 1040 is a computer monitor or other 
similar video display. 
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It is to be understood that the embodiments and variations shown and 
described herein are merely illustrative of the principles of this invention and that various 
modifications may be implemented by those skilled in the art without departing from the 
scope and spirit of the invention. 
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